63 research outputs found

    A Fast Cache-Oblivious Mesh Layout with Theoretical Guarantees

    Get PDF
    International audienceOne important bottleneck when visualizing large data sets is the data transfer between processor and memory. Cache-aware (CA) and cache-oblivious (CO) algorithms take into consideration the memory hierarchy to design cache efficient algorithms. CO approaches have the advantage to adapt to unknown and varying memory hierarchies. Recent CA and CO algorithms developed for 3D mesh layouts significantly improve performance of previous approaches. However, these algorithms are based on heuristics. We propose in this paper a new CO algorithm for meshes that has both a low theoretical complexity and proven quality. We guarantee that a coherent traversal of an N-size mesh in dimension d will induce less than N/B+N/M^{1/d}) cache misses where B and M are the block size and the cache size. We compare our layout with previous ones on several 3D meshes

    Improving Reactivity to I/O Events in Multithreaded Environments Using a Uniform, Scheduler-Centric API

    Get PDF
    Reactivity to I/O events is a crucial factor for the performance of modern multithreaded distributed systems. In our scheduler-centric approach, an application detects I/O events by requesting a service from a detection server, through a simple, uniform API. We show that a good choice for this detection server is the thread scheduler. This approach simplifies application programming, significantly improves performance, and provides a much tighter control on reactivity

    Binary Mesh Partitioning for Cache-Efficient Processing

    Get PDF
    One important bottleneck when visualizing large data sets is the data transfer between processor and memory. Cache-aware (CA) and cache-oblivious (CO) algorithms take into consideration the memory hierarchy to design cache efficient algorithms. CO approaches have the advantage to adapt to unknown and varying memory hierarchies. Recent CA and CO algorithms developed for 3D mesh layouts significantly improve performance of previous approaches, but lack of theoretical performance guarantees. We present in this report a O(N log N) algorithm to compute CO layout for unstructured meshes. We prove that a coherent traversal of a N-size mesh in dimension d will induce less than N/B+O(N/M^{1/d}) cache-misses where B and M are the block size and the cache size. Experiments show that our layout computation is faster and significantly less memory consuming than for the best known CO algorithm. Performance is comparable to this algorithm for classical visualization algorithm access patterns, or better if the access pattern is adapted to the binary mesh partitioning produced by the algorithm. We also show that cache oblivious approaches lead to significant performance increases on recent GPU architectures

    X-Kaapi C programming interface

    Get PDF
    This report defines the X-Kaapi C programming interface.The rapport d ́ecrit l'interface de programmation C pour X- Kaapi

    The X-Kaapi's Application Programming Interface. Part I: Data Flow Programming

    Get PDF
    In this report, we present X-Kaapi's programming model. A X-Kaapi parallel program is a C or C++ sequential program with code annotation using #pragma compiler directives that allow to create tasks. A specific source to source compiler translates X-Kaapi directives to runtime calls.Ce rapport présente le modèle de programmation X-Kaapi qui permet d'annonter un programme séquentiel écrit en C ou C++ par des directives de compilation #pragma afin de décrire simplement les tâches du programme. Un compilateur source à source génère un code qui permet, grâce au runtime X-Kaapi, d'extraire à l'exécution ce graphe de flot de données, y compris pour les programmes récursifs dont les tâches seront générées récursivement

    An Efficient Multi-level Trace Toolkit for Multi-threaded Applications

    Get PDF
    International audienceNowadays, observing and understanding the behavior and performance of a multithreaded application is nontrivial, especially within a complex multithreaded environment such as a multilevel thread scheduler. In this report, we present a trace toolkit that allows a programmer to precisely analyze the behavior of a multithreaded application. A application's run generates several traces that are merged and analyzed offline. The resulting super-trace contains not only classical information such as the number of elapsed cpu cycles per functions but also details about thread scheduling at multiple levels

    IV Grid Plugtests: composing dedicated tools to run an application efficiently on Grid'5000

    Get PDF
    Exploiting efficiently the resources of whole Grid'5000 with the same application requires to solve several issues: 1) resources reservation; 2) application's processes deployment; 3) application's tasks scheduling. For the IV Grid Plugtests, we used a dedicated tool for each issue to solve. The N-Queens contest rules imposed ProActive for the resources reservations (issue 1). Issue 2 was solved using TakTuk which allows to deploy a large set of remote nodes. Deployed nodes take part in the deployment using an adaptive algorithm that makes it very efficient. For the 3rd issue, we wrote our application with Athapascan API whose model is based on the concepts of tasks and shared data. The application is described as a data-flow graph using the Shared and Fork keywords. This high level abstraction of hardware gives us an efficient execution with the Kaapi runtime engine using a work-stealing scheduling algorithm to balance the workload between all the distributed processes

    Detecção de Anomalias de Desempenho em Aplicações de Alto Desempenho baseadas em Tarefas em Clusters Híbridos

    Get PDF
    National audienceProgramming paradigms in High-Performance Computing have been shifting towards task-based models which are capable to more readily adapt to heterogeneous and scalable supercomputers. Detecting performance anomalies in such environments is particularly difficult since it must consider architecture heterogeneity, variability, and the capability to obtain trusted measurements. This work presents a case-study about the detection of anomalies in the execution of the well-known tiled dense Cholesky factorization developed with StarPU. Our experiments have been conducted in a variety of hybrid multi-node platforms to demonstrate how we are capable to detect and highlight performance anomalies.Os paradigmas de programação em Computação de Alto Desempe-nho estão mudando para modelos baseados em tarefas que são capazes de se adaptar a supercomputadores com arquiteturas heterogêneas e escaláveis. A detecção de anomalias de desempenho em tal cenário é particularmente difícil uma vez que ela deve considerar a heterogeneidade da arquitetura, a variabili-dade e a capacidade de obter medições confiáveis. Este trabalho apresenta um estudo de caso sobre a detecção de anomalias na execução da conhecida fatora-ção de Cholesky por blocos desenvolvida com StarPU. Os experimentos foram conduzidos em uma variedade de plataformas com múltiplos nós híbridos para demonstrar a capacidade de detectar e destacar anomalias de desempenho

    Adaptive and Hybrid Algorithms: classification and illustration on triangular system solving

    Get PDF
    International audienceWe propose in this article a classification of the different notions of hybridization and a generic framework for the automatic hybridization of algorithms. Then, we detail the results of this generic framework on the example of the parallel solution of multiple linear systems
    corecore